PCA

wine.data = read.csv('wine.csv')
wine.data$Type = as.factor(wine.data$Type)
scaled.wine.data = data.frame(scale(wine.data[,-c(1)]))

pca = princomp(scaled.wine.data)
plot(pca)

plot(cumsum(pca$sdev^2/sum(pca$sdev^2)))

pca$loadings
## 
## Loadings:
##                              Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Alcohol                       0.144  0.484  0.207         0.266  0.214       
## Malic.acid                   -0.245  0.225        -0.537         0.537 -0.421
## Ash                                  0.316 -0.626  0.214  0.143  0.154  0.149
## Alcalinity.of.ash            -0.239        -0.612               -0.101  0.287
## Magnesium                     0.142  0.300 -0.131  0.352 -0.727        -0.323
## Total.phenols                 0.395        -0.146 -0.198  0.149              
## Flavanoids                    0.423        -0.151 -0.152  0.109              
## Nonflavanoid.phenols         -0.299        -0.170  0.203  0.501 -0.259 -0.595
## Proanthocyanins               0.313        -0.149 -0.399 -0.137 -0.534 -0.372
## Color.intensity                      0.530  0.137               -0.419  0.228
## Hue                           0.297 -0.279         0.428  0.174  0.106 -0.232
## OD280.OD315.of.diluted.wines  0.376 -0.164 -0.166 -0.184  0.101  0.266       
## Proline                       0.287  0.365  0.127  0.232  0.158  0.120       
##                              Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## Alcohol                       0.396  0.509  0.212   0.226   0.266         
## Malic.acid                                 -0.309          -0.122         
## Ash                          -0.170 -0.308          0.499          -0.141 
## Alcalinity.of.ash             0.428  0.200         -0.479                 
## Magnesium                    -0.156  0.271                                
## Total.phenols                -0.406  0.286 -0.320  -0.304   0.304  -0.464 
## Flavanoids                   -0.187        -0.163                   0.832 
## Nonflavanoid.phenols         -0.233  0.196  0.216  -0.117           0.114 
## Proanthocyanins               0.368 -0.209  0.134   0.237          -0.117 
## Color.intensity                            -0.291          -0.604         
## Hue                           0.437        -0.522          -0.259         
## OD280.OD315.of.diluted.wines         0.137  0.524          -0.601  -0.157 
## Proline                       0.120 -0.576  0.162  -0.539                 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.077  0.077  0.077  0.077  0.077  0.077  0.077  0.077  0.077
## Cumulative Var  0.077  0.154  0.231  0.308  0.385  0.462  0.538  0.615  0.692
##                Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings      1.000   1.000   1.000   1.000
## Proportion Var   0.077   0.077   0.077   0.077
## Cumulative Var   0.769   0.846   0.923   1.000
pc = prcomp(scaled.wine.data)
plot(pc)

plot(cumsum(pc$sdev^2/sum(pc$sdev^2)))

# First 7 principal components
comp <- data.frame(pc$x[,1:7])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))

Questions

  1. Apply PCA to “wine” dataset with all numerical variables and reduce the dimension.

  2. How many principal components need to explain more than 85% of variances?

I used 7 principal components to account for roughly 90% of variance.

  1. Is there any cluster formed by different types of wines? If so how many clusters can you see them?
autoplot(pc, data = wine.data, colour = 'Type')

There looks to be three clusters which also correspond to the three Types. I can see some mix of a few points but overall, the split is pretty good at only two dimensions.

Classification

MLR

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Registered S3 method overwritten by 'parsnip':
##   method          from     
##   autoplot.glmnet ggfortify
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ dplyr::combine()        masks gridExtra::combine()
## ✖ scales::discard()       masks purrr::discard()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ recipes::fixed()        masks stringr::fixed()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ dials::prune()          masks rpart::prune()
## ✖ yardstick::spec()       masks readr::spec()
## ✖ recipes::step()         masks stats::step()
## ✖ tune::tune()            masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(nnet)
# Split the data into training and test set
set.seed(123)
pca.data = cbind(wine.data$Type, comp)
names(pca.data)[1] = 'Type'
training.samples <- wine.data$Type %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- pca.data[training.samples, ]
test.data <- pca.data[-training.samples, ]

mlr = multinom_reg() |> 
  fit(Type~., data = train.data)

# Summarize the model
tidy(mlr, exponentiate = TRUE, conf.int = TRUE) |> 
  mutate_if(is.numeric, round, 4) |> 
  select(-std.error, -statistic)
## # A tibble: 16 × 6
##    y.level term        estimate p.value conf.low  conf.high
##    <chr>   <chr>          <dbl>   <dbl>    <dbl>      <dbl>
##  1 2       (Intercept) 4.48e+ 0   0.999        0 Inf       
##  2 2       PC1         1.00e+ 4   0.980        0 Inf       
##  3 2       PC2         1.81e+12   0.992        0 Inf       
##  4 2       PC3         2.07e+ 2   0.992        0 Inf       
##  5 2       PC4         3.00e- 1   0.994        0   8.19e128
##  6 2       PC5         0          0.901        0   1.30e 72
##  7 2       PC6         3.82e+ 1   0.995        0 Inf       
##  8 2       PC7         3.11e+ 1   0.999        0 Inf       
##  9 3       (Intercept) 3   e- 4   0.994        0 Inf       
## 10 3       PC1         4.13e+ 7   0.966        0 Inf       
## 11 3       PC2         0          0.997        0 Inf       
## 12 3       PC3         2.28e- 2   0.994        0 Inf       
## 13 3       PC4         1.83e- 1   0.993        0   4.74e162
## 14 3       PC5         3   e- 4   0.989        0 Inf       
## 15 3       PC6         9.88e+ 3   0.988        0 Inf       
## 16 3       PC7         0          0.997        0 Inf
Type_preds <- mlr |> 
  augment(new_data = test.data)

conf_mat(Type_preds, truth = Type, estimate = .pred_class)
##           Truth
## Prediction  1  2  3
##          1 11  0  0
##          2  0 14  0
##          3  0  0  9
roc_auc(Type_preds, truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till          1
roc_curve(Type_preds, truth = Type, .pred_1, .pred_2, .pred_3) |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
  geom_line(size = 1, alpha = 0.7) +
  geom_abline(slope = 1, linetype = "dotted") +
  coord_fixed() +
  labs(color = NULL) +
  theme_light()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

possibilities <- expand_grid(
  PC1 = seq(-4.31, 4.27, length.out = 10),
  PC2 = seq(-3.51, 3.87, length.out = 10),
  PC3 = seq(-4.57, 5.33, length.out = 10),
  PC4 = seq(-2.88, 3.78, length.out = 10),
  PC5 = seq(-4.18, 2.02, length.out = 10),
  PC6 = seq(-1.91, 3.28, length.out = 10),
  PC7 = seq(-1.63, 2.64, length.out = 10)
)

possibilities <- bind_cols(possibilities,
                           predict(mlr, new_data = possibilities))

possibilities |> 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point(aes(color = .pred_class), alpha = 0.1) +
  geom_point(data = test.data, aes(color = Type, shape = Type),
             size = 2,
             alpha = 0.8) +
  labs(color = "Type", shape = "Type") +
  theme_light()

SVM

svm_rbf_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab", scaled = FALSE)

svm_rbf_fit <- svm_rbf_spec %>% 
  fit(Type ~ ., data = train.data)

svm_rbf_fit
## parsnip model object
## 
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0829987516109757 
## 
## Number of Support Vectors : 54 
## 
## Objective Function Value : -11.6864 -4.0413 -11.2833 
## Training error : 0.006944 
## Probability model included.
augment(svm_rbf_fit, new_data = test.data) %>%
  conf_mat(truth = Type, estimate = .pred_class)
##           Truth
## Prediction  1  2  3
##          1 11  0  0
##          2  0 14  1
##          3  0  0  8
augment(svm_rbf_fit, new_data = test.data) %>%
  roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
  autoplot()

augment(svm_rbf_fit, new_data = test.data) %>%
  roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till          1

KNN

#make a knn spec
knn_spec <- nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

#use the knn spec to fit the pre-processed training data
knn_fit <- knn_spec %>% 
  fit(Type ~., data = train.data)
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
augment(knn_fit, new_data = test.data) %>%
  conf_mat(truth = Type, estimate = .pred_class)
##           Truth
## Prediction  1  2  3
##          1 11  0  0
##          2  0 13  0
##          3  0  1  9
augment(knn_fit, new_data = test.data) %>%
  roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
  autoplot()

augment(knn_fit, new_data = test.data) %>%
  roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till          1

CART

tree_spec <- decision_tree() %>%
  set_engine("rpart")

class_tree_spec <- tree_spec %>%
  set_mode("classification")

class_tree_fit <- class_tree_spec %>%
  fit(Type ~ ., data = train.data)

class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

augment(class_tree_fit, new_data = test.data) %>%
  conf_mat(truth = Type, estimate = .pred_class)
##           Truth
## Prediction  1  2  3
##          1 10  2  0
##          2  1 12  0
##          3  0  0  9
augment(class_tree_fit, new_data = test.data) %>%
  roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
  autoplot()

augment(class_tree_fit, new_data = test.data) %>%
  roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.953

RF

#make a rf spec
rf_spec <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

#use the rf spec to fit the pre-processed training data
rf_fit <- rf_spec %>% 
  fit(Type ~., data = train.data)

augment(rf_fit, new_data = test.data) %>%
  conf_mat(truth = Type, estimate = .pred_class)
##           Truth
## Prediction  1  2  3
##          1 10  0  0
##          2  1 14  0
##          3  0  0  9
augment(rf_fit, new_data = test.data) %>%
  roc_curve(truth = Type, .pred_1, .pred_2, .pred_3) %>%
  autoplot()

augment(rf_fit, new_data = test.data) %>%
  roc_auc(truth = Type, .pred_1, .pred_2, .pred_3)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.997

Clustering

## Libraries for Plotting our Results
library(ggplot2)
library(ggfortify)
library(gridExtra)

library(GGally)      # some nice plot extensions to ggplot
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(cowplot)     # allows us to create matrix plots
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(varhandle)   # provides unfactor() function
library(cluster)     # provides bottom-up clustering support
library(factoextra)  # support for plotting of clusters
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dbscan)      # support for the dbscan algorithm
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
set.seed(125)        # so we get consistent results for rendering this tutorial

diabetes_data <- read.csv("diabetes.csv", sep=",")
colnames(diabetes_data)<-c("Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","PedigreeFunction","Age","Outcome")
diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)
summary(diabetes_data)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        PedigreeFunction      Age        Outcome
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   0:500  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   1:268  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00          
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24          
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00          
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00
diabetes_matrix = data.frame(scale(diabetes_data[, -c(9)]))

pc = prcomp(diabetes_matrix)

# Plot of Variance Explained
plot(cumsum(pc$sdev^2/sum(pc$sdev^2)), main = 'Cumulative Variance Explained by Components', 
     ylab="Explained", xlab="Component")

pc_plot1 <- autoplot(pc, data = diabetes_data, main = "Plot of Iris Data in Principal Components Space")
pc_plot2 <- autoplot(pc, data=diabetes_data, colour = "Outcome", main = "Plot of Iris Data in Principal Components Space") 

plot_grid(                                    # uses cowplot library to arrange grid
  pc_plot1, pc_plot2, 
  nrow = 1
)

### K means

# Set a maximum number of groups to explore
k.max <- 10

# Fit a kmeans cluster for each number of groups and calculate wthin sum of squares
wss <- sapply(1:k.max, function(k){kmeans(diabetes_matrix, k)$tot.withinss})
wss
##  [1] 6136.000 5144.194 4354.122 4090.571 3621.022 3357.598 3134.841 3000.041
##  [9] 3005.599 2759.770
# Plot the results so we can find the "elbow"
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

# Apply silhouette method to determine clusters
fviz_nbclust(diabetes_matrix, kmeans, method = "silhouette")

# Calculate the K-Means Clusters 
km.out <- kmeans(diabetes_matrix, 2, nstart=20)
km.out
## K-means clustering with 2 clusters of sizes 272, 496
## 
## Cluster means:
##   Pregnancies    Glucose BloodPressure SkinThickness     Insulin         BMI
## 1   0.9581251  0.4333542     0.4065341   -0.14287076 -0.02518500  0.11840510
## 2  -0.5254234 -0.2376459    -0.2229381    0.07834848  0.01381113 -0.06493183
##   PedigreeFunction        Age
## 1       0.03157570  1.0489862
## 2      -0.01731571 -0.5752505
## 
## Clustering vector:
##   [1] 1 2 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 1
##  [38] 1 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2
##  [75] 2 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 1 2 1 2 2 1 2
## [149] 1 2 2 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 1 2 1 1 2 2 2 2 1
## [186] 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 2 1 2 1
## [223] 2 1 2 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 2 2
## [260] 1 1 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2
## [297] 2 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 1 1 2 2
## [334] 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 2 2 2 2 1
## [371] 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1
## [408] 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1
## [445] 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 1 2
## [482] 2 2 2 2 2 2 1 2 1 2 2 2 1 2 1 2 2 1 1 2 2 2 1 2 1 2 2 2 1 1 2 1 2 2 2 1 1
## [519] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 1 1 2 1 1 2 2 1 2 2
## [556] 1 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2
## [593] 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 1
## [630] 2 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 2
## [667] 1 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1
## [704] 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2
## [741] 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 2014.888 3107.164
##  (between_SS / total_SS =  16.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Plot of assigned clusters
fviz_cluster(list(data = diabetes_matrix, cluster = km.out$cluster))

# Comparing within sum of squares to between sum of squares
# Between Cluster Sum of Squares 
km.out$betweenss
## [1] 1013.948
# Within Cluster Sum of Squares
km.out$withinss
## [1] 2014.888 3107.164
# The Ratio
km.out$withinss/km.out$betweenss
## [1] 1.987171 3.064422
# Compare our results from km.out$cluster with the known values
results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), km.out$cluster))
table(results.compare)
##    X2
## X1    1   2
##   0 127 373
##   1 145 123
kmeans_accuracy <- (373+145)/768    # make sure values match table due to results changing each time algorithm is applied
kmeans_accuracy
## [1] 0.6744792

Hierarchical

# Dissimilarity matrix
d <- dist(diabetes_matrix, method = "euclidean")   # can try different distance metrics

# Hierarchical clustering using Complete Linkage
hc_agg <- agnes(d, method = "complete" )   # can try different methods here
hc_agg
## Call:     agnes(x = d, method = "complete") 
## Agglomerative coefficient:  0.9041579 
## Order of objects:
##   [1]   1 755 702 161 757 213 370 549 441 507 550 703  57 546 663 133 426 196
##  [19] 245 425 470 156 210 762  15 286 224 307 499  54 207 237 604 671 389  41
##  [37] 400 228 108 569 723 131 261 516 697 313 647 428 749 612  70 190 111 123
##  [55] 522 412 727 701 542 686 529 298 737 136 570 443 548 326 634 385 420 528
##  [73] 705 731 766 167 414 728 398 472 482 603  23 340 155  27 179 208 599  85
##  [91] 304 231 236 392 379  73 659 692  24 479 668 195 311  34 181 617 732 334
## [109] 635 763  65 220 584 144 247 251  77 177 506 560 275 636 519  31 720 363
## [127] 353 492  93 388 346 724  35  38 359 511 746 134 463 738 147 404 673  87
## [145] 559 713 299 615 741 271 591 315 543  22 345 675 279 300 510 558 513 255
## [163] 520  29 583  43 461 718 460 764  25 649 670 324  89 437 456 115 176 339
## [181] 500 186 239  26 282 215 283  96 556 162 568 205 669 192 694 664 541 595
## [199]  72 218 172 296 217 290 653 403 387 712 418 411 614 639  83 504 166 189
## [217] 478  44  55 613 216 376 459 517 153 260 745 160   3  45 193 405 318 676
## [235]  12 328  37 579  62 587 284 643 474 356 709 661 677 402 750 320 362 496
## [253] 490 760  68 116 222 130  94 141 285 735 537 725 758 767 117 629 593 518
## [271] 561 338 407 637 124 553 149 364 264 667 476 480  13 331 619 246 409  42
## [289] 180 440 524 744 465 132 444 631 691 168 395 751  10 685   8 223 469  16
## [307] 534 644 270 536 620 348 431 602 590 698 173 598  79 436 485 704 333 605
## [325] 262 267 301 337 194 358   2 175 211 483 235 521  66 464 721 104 159 253
## [343] 369 120 150 630  80 753  90 258 551  84 650 323 250 450 377 501 601  48
## [361] 233 433 498 638  75 768 545 442 554 242 458 552 263  88 567 473 765  97
## [379] 241 113 226 322 325 209 655 332 574 110 291 484 544  92 699 170 641 342
## [397] 432 129 430 391  49 122 330 493 665  78 502 109 736 142 505 726 303 706
## [415] 626 128 374 576 256 606 257 254 557 448 632 274 312 710 378 563 449 468
## [433] 423 373 491 555  20 327 182 292  71 390  86 381 645 310 357 375  36 397
## [451] 577 157 566 158 509 761 119 164 424 143 573 137 138 278 316 652 455 739
## [469] 611 734 742 314 531 347  67 201 203 266 494 435   4 335 651 743 272 657
## [487]  33 515 289 349  81 386 688 600  98 681   7  99 366  53 489 206 277 582
## [505] 445 564  64 106 508 689  95 280 383 321 680 308 512 610 341 466   6 497
## [523] 588 367  18 171 265 401 616 715 169 684 234 642 252 434 352 438 152 394
## [541] 185 273 305  47 452 102 627 243 687 633 654 227 759 678 475 525 625 679
## [559]  28 451  51 135 198 225 447 354 422 240 419 530  69 527 608 526 618 368
## [577] 382 439 204 640 317  52 586 672  56 467  91 408 399 462 514 191 695 197
## [595] 269 730  63 114 184 118  11  30 344 351 571 740 125 628 202 281 139 165
## [613] 565 105 700 355 532 578 597 103 572 729 107  76 183 343 350 503  50 523
## [631] 707  61  82 495 427 295 457 454 538 146 372 148 244 309 396 219 660 384
## [649] 417 535 594 622   5 446  40 188 293 658 540 756 488 547 589  46 662  59
## [667] 101 623 229 371   9  14 154 487 410 287 416 646 754 656 248  32 717 607
## [685] 413 575 360 562 596 221 259 361 716  74 140 145 200 365 481 297 393 714
## [703] 708 711 112 187 585 232 249 696  17 421 288  21 539 477  60 722 453 719
## [721] 306 752 592 666 621 319 127 214 151 486 230 329 624 683 100 429 163 336
## [739] 609 690 294 406 302 415 648 121 212 581 471 682 747 238 733  39 174 276
## [757] 533 199 693 268  58 380 748  19 126 178 674 580
## Height (summary):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3453  0.9838  1.4190  1.8163  2.1925 12.2052 
## 
## Available components:
## [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method"
# After trying all options, ward is best fit
hc_ward <- agnes(d, method = "ward" ) 
hc_ward
## Call:     agnes(x = d, method = "ward") 
## Agglomerative coefficient:  0.9712135 
## Order of objects:
##   [1]   1 755 702 161 757 387 712 418 186 210 762 213 499 370 549 441 507 550
##  [19] 703 580  24 479 668  26 282 215  96 556 307 649 670  25  89 437 456  87
##  [37] 559 713  72 218 217 172 296 115 176 339 500 156 425 470 192 694 664 541
##  [55] 595 160 216 376 459 517 299 615 741 271 591  31 720 363  93 388 346 724
##  [73] 162 568 205 669 764  35  38 543 359 511 746 134 463 738 147 404 673  66
##  [91] 464 721 353 492 142 505 726 303 706 626 290 653 403 330 493 665   5 446
## [109]  46 662 229 371  13 331 619 153 260 745 239 324 246 409  59 623 101 751
## [127]  42 180 440 524 744 132 444 631 691 465  83 504 166 189 478 411 614 639
## [145] 266 494 315 146 372 148 535 219 660 594 622 384 417 435 309 396   9  14
## [163] 154 487 410 287 416 646 754 656 248  32 717 413 575 607 360 562 596 221
## [181] 259  41 400 111 228 648 108 569 723 428 749 612 131 261 516 697 313 647
## [199]  74 140 145 200 297 365 481 133 426 196 245 393 714 708 711  15 286 224
## [217] 389 237 604 671 255 520 488 547 589  44  54 207  57 546 663  55 613 361
## [235] 716 112 187 585 232 249 696   3  45 193 405 318 676  12 328  23 340 155
## [253] 124 553 149 362 496 402 750 208 760 320 490 284 643 474 356 709 661 677
## [271]  27 179 518 561  65 220 584 117 629 593 338 407 637 537 725 758 767 571
## [289] 740 597  68 116 222 364 264 667 476 480  94 141 285 735 295 457 130 538
## [307]  10 685  22 345 675 279 300 510 558 513  29 583 460  43 461 718  37 579
## [325]  62 587 144 247 251  77 177 506 560 635 763 275 636 519 334  73 659 692
## [343]   2 175 211 483   4 104 159 235 521  80 753  90 258 551  84 650  97 241
## [361] 150 630 272 657  33 515 289 349 335 651  81 386 688 600 611 734 742  56
## [379] 467  98 681 526 618   7  99 366 242 458 552 263  49 122  78 502 109 736
## [397] 119 164 424 143 573 256 606 257  53 489 206 277 582 445 564 195 311  63
## [415] 114 184 118  76 183 343 350 503  28 451  69 527 608 368 382 439  51 377
## [433] 120 253 369 204 640 317 135 225 447 198 354 422 240 419 530  52 586 672
## [451] 514 633 654  91 408 399 462 191 695 197 269 102 627 243 687  20 327 310
## [469] 182 292  71 390  86 381 645  64 106 508 302 415 529  95 689  21 539 477
## [487] 298 737 412 727 701 542 686  36 397 577 244 157 566 158 314 531 347 509
## [505] 761 280 383 680 321  48 233 433 498 638 137 138 278 442 554  67 201 203
## [523] 170 641 342 432  70 190 385 420 528 705 731 766  92 699 283 167 414 450
## [541] 728 323 250 501 601 308 512 326 634 341 466 610 743 123 522 209 655 316
## [559] 652 332 574 136 570 443 548 129 430 391   6 497 588 367  18 171 265 401
## [577] 616 715  34 181 617 732 152 394 185 273 305 103 572 729 107  11  30 344
## [595] 351 169 234 642 684 252 434 352 438  47 452 168 395 139 165 565 125 628
## [613] 227 759 475 525 625 679 678 730  85 304 202 281 599 231 236 392 379 105
## [631] 700 355 532 578  17 421 288 100 690 151 486 429 163 336 609 178 674 121
## [649] 238 733 212 581 471 682 747  40 188 540 756 293 658  58 380 748  19 126
## [667] 174 423 276 533 199 693 268 294 406 357 375 312 710 378 563 449 468 373
## [685] 491 555 455 739  39 254 557  88 567 473 765 128 374 576 322 325 482 603
## [703] 398 472  75 768 545 113 226 110 291 484 544 274 448 632  60 722 453 719
## [721] 624 683 127 214 230 329 306 752 319 592 666 621   8 223 469  16 534 644
## [739] 173 598 348 431 602 590 698 454  79 436 485 704 333 270 536 620 605 262
## [757] 267 301 337 194 358  50 523 707  61  82 495 427
## Height (summary):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3453  0.9918  1.4537  2.3417  2.4063 40.4477 
## 
## Available components:
## [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method"
# Plot the obtained dendrogram
pltree(hc_ward, cex = 0.6, hang = -1, main="Ward Linkage")

# Cut tree into 3 groups
hc_sub_grp <- cutree(hc_ward, k = 2)  # gives us a vector of the cluster assigned for each observation
# Plot our clusters
fviz_cluster(list(data = diabetes_matrix, cluster = hc_sub_grp))

# Compare our results from km.out$cluster with the known values
hcward_results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), hc_sub_grp))
table(hcward_results.compare)
##    hc_sub_grp
## V1    1   2
##   0 167 333
##   1 175  93
hc_ward_accuracy <- (333+175)/768
hc_ward_accuracy
## [1] 0.6614583
# Compare our results from km.out$cluster with the known values
# Conduct divisive (top down) clustering
hc_divisive <- diana(diabetes_matrix)
hc_divisive$dc                     # get the divisive coefficient - equivalent to ac
## [1] 0.8905177

DBSCAN

##### Calculate the epsilon neighborhood - i.e. find optimal eps
kNNdistplot(diabetes_matrix, k=5) 

# Implement DBSCAN
db = dbscan(diabetes_matrix, eps = 2, minPts = 5, borderPoints = FALSE)
# Visualize the results
fviz_cluster(list(data = diabetes_matrix, cluster = db$cluster))

# Hullplot from dbscan package isn't confused by unassigned cases
hullplot(diabetes_matrix, db$cluster)

# Compare our results from km.out$cluster with the known values
dbscan_results.compare<-data.frame(cbind(as.character(diabetes_data$Outcome), db$cluster))
table(dbscan_results.compare)
##    X2
## X1    0   1   2   3
##   0  35 449  10   6
##   1  48 209  11   0
dbscan_accuracy <- (449+48)/768
dbscan_accuracy
## [1] 0.6471354

Question 1

Which does clustering method cluster the best based on different types of Outcomes? Justifying with pictures and accuracy rates.

Discuss each of clustering method on this data set. Interpret the results, pros and cons on the data set for each method.

K Means cluster
fviz_cluster(list(data = diabetes_matrix, cluster = km.out$cluster))

K-means accuracy is about 67%

Hierarchical cluster
fviz_cluster(list(data = diabetes_matrix, cluster = hc_sub_grp))

Hierarchical accuracy is about 66%

DBSCAN cluster
fviz_cluster(list(data = diabetes_matrix, cluster = db$cluster))

DBSCAN accuracy is about 64%

From this, we determine that K means clustering is best for this diabetes dataset. Since we know that the response is binary, we can leverage this knowledge to force k means and hierarchical into two clusters. We dont have that same option with DBSCAN resulting in a bit lower overall accuracy.

Classification

diabetes_data$cluster = km.out$cluster

diabetes_data$Outcome = as.factor(diabetes_data$Outcome)
diabetes_data$cluster = as.factor(diabetes_data$cluster)
diabetes_data[,c(1:8)] = scale(diabetes_data[,c(1:8)])

training.samples <- diabetes_data$Outcome %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- diabetes_data[training.samples, ]
test.data <- diabetes_data[-training.samples, ]

Logit

Logit <- glm(formula = Outcome ~ .,
               family  = binomial(link = "logit"),
               data    = train.data)
summary(Logit)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"), 
##     data = train.data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.710017   0.288393  -2.462  0.01382 *  
## Pregnancies       0.283801   0.143911   1.972  0.04860 *  
## Glucose           1.087337   0.132598   8.200 2.40e-16 ***
## BloodPressure    -0.296904   0.119728  -2.480  0.01314 *  
## SkinThickness    -0.003014   0.121949  -0.025  0.98028    
## Insulin          -0.193470   0.116640  -1.659  0.09718 .  
## BMI               0.827599   0.140617   5.885 3.97e-09 ***
## PedigreeFunction  0.349440   0.113738   3.072  0.00212 ** 
## Age               0.148225   0.162775   0.911  0.36250    
## cluster2         -0.268043   0.411298  -0.652  0.51459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 796.05  on 614  degrees of freedom
## Residual deviance: 579.60  on 605  degrees of freedom
## AIC: 599.6
## 
## Number of Fisher Scoring iterations: 5
Logit.step<-stats::step(Logit, direction = "both")
## Start:  AIC=599.6
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + PedigreeFunction + Age + cluster
## 
##                    Df Deviance    AIC
## - SkinThickness     1   579.60 597.60
## - cluster           1   580.03 598.03
## - Age               1   580.43 598.43
## <none>                  579.60 599.60
## - Insulin           1   582.33 600.33
## - Pregnancies       1   583.53 601.53
## - BloodPressure     1   585.84 603.84
## - PedigreeFunction  1   589.54 607.54
## - BMI               1   620.21 638.21
## - Glucose           1   663.79 681.79
## 
## Step:  AIC=597.6
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     PedigreeFunction + Age + cluster
## 
##                    Df Deviance    AIC
## - cluster           1   580.03 596.03
## - Age               1   580.44 596.44
## <none>                  579.60 597.60
## - Insulin           1   582.94 598.94
## - Pregnancies       1   583.54 599.54
## + SkinThickness     1   579.60 599.60
## - BloodPressure     1   586.03 602.03
## - PedigreeFunction  1   589.56 605.56
## - BMI               1   624.23 640.23
## - Glucose           1   665.10 681.10
## 
## Step:  AIC=596.03
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     PedigreeFunction + Age
## 
##                    Df Deviance    AIC
## <none>                  580.03 596.03
## - Age               1   583.01 597.01
## - Insulin           1   583.47 597.47
## + cluster           1   579.60 597.60
## + SkinThickness     1   580.03 598.03
## - BloodPressure     1   586.06 600.06
## - Pregnancies       1   588.09 602.09
## - PedigreeFunction  1   590.18 604.18
## - BMI               1   624.88 638.88
## - Glucose           1   673.70 687.70
summary(Logit.step)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + PedigreeFunction + Age, family = binomial(link = "logit"), 
##     data = train.data)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.8845     0.1088  -8.132 4.21e-16 ***
## Pregnancies        0.3359     0.1196   2.810  0.00496 ** 
## Glucose            1.1047     0.1291   8.556  < 2e-16 ***
## BloodPressure     -0.2824     0.1158  -2.439  0.01472 *  
## Insulin           -0.1970     0.1058  -1.863  0.06250 .  
## BMI                0.8282     0.1347   6.149 7.81e-10 ***
## PedigreeFunction   0.3519     0.1134   3.103  0.00191 ** 
## Age                0.2164     0.1251   1.729  0.08380 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 796.05  on 614  degrees of freedom
## Residual deviance: 580.03  on 607  degrees of freedom
## AIC: 596.03
## 
## Number of Fisher Scoring iterations: 5
# Put the predicted probability and class (at 0.5 threshold) at the end of the dataframe
predProbLogit <- predict(Logit.step, type = "response", newdata = test.data)
predClassLogit <- factor(predict(Logit.step, type = "response", newdata=test.data) > 0.45, levels = c(F,T), labels = c("0","1"))

# Generate a confusion matrix and performance statistics on test dataset
confusionMatrix(data=predClassLogit, reference=test.data$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 87 17
##          1 13 36
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.7321, 0.8636)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 3.3e-05         
##                                           
##                   Kappa : 0.5592          
##                                           
##  Mcnemar's Test P-Value : 0.5839          
##                                           
##             Sensitivity : 0.8700          
##             Specificity : 0.6792          
##          Pos Pred Value : 0.8365          
##          Neg Pred Value : 0.7347          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5686          
##    Detection Prevalence : 0.6797          
##       Balanced Accuracy : 0.7746          
##                                           
##        'Positive' Class : 0               
## 
# Calculate AUC - use @y.values to get AUC out
Logit.AUC <- performance(prediction(predProbLogit, test.data$Outcome), measure="auc")@y.values
Logit.AUC
## [[1]]
## [1] 0.8415094

SVM

# Set a seed value so get same answer each time we fit
set.seed(123)

# Balance data using weights
## SVM performs poorly when the data set is not balanced.
wts <- 100/table(diabetes_data$Outcome)

# Apply SVM model using linear kernel having default target and rest three as predictors
 SVM <- svm(Outcome ~ .,data=train.data, kernel="linear",
             cost=1,gamma=1, class.weight=wts,
             probability=TRUE, decision.values=TRUE)

# Get the probabilities predicted by SVM
predProbSVM.raw <-predict(SVM, test.data, probability = TRUE)

# Get the probabilitiy of "Yes" from the attributes
predProbSVM <- attributes(predProbSVM.raw)$probabilities[,1]

# Get the probability classes
predClassSVM <- predict(SVM, newdata = test.data)

confusionMatrix(predClassSVM, test.data$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 13
##          1 19 40
##                                           
##                Accuracy : 0.7908          
##                  95% CI : (0.7178, 0.8523)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 0.0001499       
##                                           
##                   Kappa : 0.5501          
##                                           
##  Mcnemar's Test P-Value : 0.3767591       
##                                           
##             Sensitivity : 0.8100          
##             Specificity : 0.7547          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.6780          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5294          
##    Detection Prevalence : 0.6144          
##       Balanced Accuracy : 0.7824          
##                                           
##        'Positive' Class : 0               
## 
# Calculate ROC statistics for our best logit model
Logit.ROC <- performance(prediction(predProbLogit, test.data$Outcome), measure="tpr", x.measure="fpr")

SVM.ROC <- performance(prediction(predProbSVM, test.data$Outcome), measure="tpr", x.measure="fpr")

SVM.AUC <- performance(prediction(predProbSVM, test.data$Outcome), measure="auc")@y.values

KNN

library(class) # new library for KNN modeling
library(rpart) # new library for CART modeling

# New package for making pretty pictures of CART trees
library(rpart.plot)

# Assigning the response 
train.def <- train.data$Outcome
test.def <- test.data$Outcome

# Assign explanatory variables
train.gc <- train.data[,c(1:8,10)]
test.gc <- test.data[,c(1:8,10)]

knn.1 <-  knn(train.gc, test.gc, train.def, k=1)
knn.5 <-  knn(train.gc, test.gc, train.def, k=5)
knn.20 <- knn(train.gc, test.gc, train.def, k=20)

sum(test.def == knn.1)/length(test.def)  # For knn = 1
## [1] 0.7320261
sum(test.def == knn.5)/length(test.def) # For knn = 5
## [1] 0.7320261
sum(test.def == knn.20)/length(test.def)  # For knn = 20
## [1] 0.7712418
knn <-  knn(train.gc, test.gc, train.def, k=28)
sum(test.def == knn)/length(test.def)  # For knn = sqrt(n)
## [1] 0.8039216
# Confusion Matrix 
confusionMatrix(knn, test.data$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 92 22
##          1  8 31
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.7321, 0.8636)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 3.3e-05         
##                                           
##                   Kappa : 0.5383          
##                                           
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.5849          
##          Pos Pred Value : 0.8070          
##          Neg Pred Value : 0.7949          
##              Prevalence : 0.6536          
##          Detection Rate : 0.6013          
##    Detection Prevalence : 0.7451          
##       Balanced Accuracy : 0.7525          
##                                           
##        'Positive' Class : 0               
## 
knn.prob <-  knn(train.gc, test.gc, train.def, k=28, prob=TRUE)
# KNN model
KNN.prob <- rep(0, length(knn.prob)) ## Initializing 

for(i in 1:length(KNN.prob)){
  KNN.prob[i] <- ifelse(knn.prob[i] != "0", attr(knn.prob, 'prob')[i], 1 - attr(knn.prob, 'prob')[i])
}
predProbKNN <- prediction(KNN.prob, test.data$Outcome)

KNN.ROC <- performance(predProbKNN, measure="tpr", x.measure="fpr")

KNN.AUC <- performance(prediction(KNN.prob, test.data$Outcome), 
                       measure="auc")@y.values

CART

CART <- rpart(Outcome ~., data=train.data)
summary(CART)
## Call:
## rpart(formula = Outcome ~ ., data = train.data)
##   n= 615 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.25581395      0 1.0000000 1.0000000 0.05500133
## 2 0.10232558      1 0.7441860 0.7906977 0.05158560
## 3 0.01395349      2 0.6418605 0.6883721 0.04930753
## 4 0.01085271     10 0.5209302 0.7395349 0.05050158
## 5 0.01000000     16 0.4511628 0.7255814 0.05018716
## 
## Variable importance
##          Glucose              BMI              Age          Insulin 
##               36               17               10               10 
##    BloodPressure          cluster PedigreeFunction      Pregnancies 
##                8                7                6                4 
##    SkinThickness 
##                2 
## 
## Node number 1: 615 observations,    complexity param=0.255814
##   predicted class=0  expected loss=0.3495935  P(node) =1
##     class counts:   400   215
##    probabilities: 0.650 0.350 
##   left son=2 (388 obs) right son=3 (227 obs)
##   Primary splits:
##       Glucose < 0.2065977    to the left,  improve=53.06460, (0 missing)
##       BMI     < -0.3985939   to the left,  improve=25.82002, (0 missing)
##       Age     < -0.4031286   to the left,  improve=21.71025, (0 missing)
##       cluster splits as  RL, improve=20.29458, (0 missing)
##       Insulin < 0.5484062    to the left,  improve=11.70732, (0 missing)
##   Surrogate splits:
##       Insulin       < 0.3575069    to the left,  agree=0.699, adj=0.185, (0 split)
##       BloodPressure < 0.61452      to the left,  agree=0.667, adj=0.097, (0 split)
##       Age           < 1.297518     to the left,  agree=0.663, adj=0.088, (0 split)
##       cluster       splits as  RL, agree=0.663, adj=0.088, (0 split)
##       BMI           < 0.9839249    to the left,  agree=0.657, adj=0.070, (0 split)
## 
## Node number 2: 388 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.1907216  P(node) =0.6308943
##     class counts:   314    74
##    probabilities: 0.809 0.191 
##   left son=4 (102 obs) right son=5 (286 obs)
##   Primary splits:
##       BMI         < -0.7030017   to the left,  improve=10.066900, (0 missing)
##       Age         < -0.4031286   to the left,  improve= 9.205109, (0 missing)
##       cluster     splits as  RL, improve= 7.046622, (0 missing)
##       Pregnancies < -0.1024022   to the left,  improve= 5.806035, (0 missing)
##       Glucose     < -0.6065982   to the left,  improve= 5.159331, (0 missing)
##   Surrogate splits:
##       BloodPressure    < -2.950302    to the left,  agree=0.745, adj=0.029, (0 split)
##       PedigreeFunction < 2.792164     to the right, agree=0.745, adj=0.029, (0 split)
## 
## Node number 3: 227 observations,    complexity param=0.1023256
##   predicted class=1  expected loss=0.3788546  P(node) =0.3691057
##     class counts:    86   141
##    probabilities: 0.379 0.621 
##   left son=6 (64 obs) right son=7 (163 obs)
##   Primary splits:
##       BMI              < -0.2590736   to the left,  improve=15.305370, (0 missing)
##       Glucose          < 1.426391     to the left,  improve=11.063870, (0 missing)
##       PedigreeFunction < -0.462913    to the left,  improve= 6.218467, (0 missing)
##       Age              < -0.7432579   to the left,  improve= 4.007314, (0 missing)
##       BloodPressure    < -1.348715    to the right, improve= 3.002903, (0 missing)
##   Surrogate splits:
##       Age              < 1.807712     to the right, agree=0.740, adj=0.078, (0 split)
##       PedigreeFunction < -1.045416    to the left,  agree=0.727, adj=0.031, (0 split)
## 
## Node number 4: 102 observations
##   predicted class=0  expected loss=0  P(node) =0.1658537
##     class counts:   102     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 286 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.2587413  P(node) =0.4650407
##     class counts:   212    74
##    probabilities: 0.741 0.259 
##   left son=10 (144 obs) right son=11 (142 obs)
##   Primary splits:
##       Age              < -0.4031286   to the left,  improve=8.332272, (0 missing)
##       cluster          splits as  RL, improve=6.792232, (0 missing)
##       PedigreeFunction < 1.928972     to the left,  improve=6.251258, (0 missing)
##       Glucose          < -0.6691517   to the left,  improve=5.980590, (0 missing)
##       Pregnancies      < 0.1943709    to the left,  improve=5.298337, (0 missing)
##   Surrogate splits:
##       Pregnancies   < -0.1024022   to the left,  agree=0.787, adj=0.570, (0 split)
##       cluster       splits as  RL, agree=0.780, adj=0.556, (0 split)
##       Insulin       < -0.6230214   to the right, agree=0.640, adj=0.275, (0 split)
##       BloodPressure < 0.09787922   to the left,  agree=0.633, adj=0.261, (0 split)
##       SkinThickness < -0.9739372   to the right, agree=0.608, adj=0.211, (0 split)
## 
## Node number 6: 64 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.328125  P(node) =0.104065
##     class counts:    43    21
##    probabilities: 0.672 0.328 
##   left son=12 (47 obs) right son=13 (17 obs)
##   Primary splits:
##       Glucose       < 1.223092     to the left,  improve=3.132392, (0 missing)
##       Age           < -0.4881609   to the left,  improve=2.684247, (0 missing)
##       Pregnancies   < -0.3991752   to the left,  improve=1.707615, (0 missing)
##       BMI           < -1.115221    to the left,  improve=1.692434, (0 missing)
##       BloodPressure < 0.61452      to the right, improve=1.064808, (0 missing)
##   Surrogate splits:
##       Age < 2.615519     to the left,  agree=0.75, adj=0.059, (0 split)
## 
## Node number 7: 163 observations,    complexity param=0.01085271
##   predicted class=1  expected loss=0.2638037  P(node) =0.2650407
##     class counts:    43   120
##    probabilities: 0.264 0.736 
##   left son=14 (110 obs) right son=15 (53 obs)
##   Primary splits:
##       Glucose          < 1.395115     to the left,  improve=5.571202, (0 missing)
##       PedigreeFunction < -0.4930945   to the left,  improve=4.423668, (0 missing)
##       BloodPressure    < -0.4187616   to the right, improve=3.173023, (0 missing)
##       BMI              < -0.1449207   to the right, improve=1.966217, (0 missing)
##       Age              < -0.2330639   to the left,  improve=1.636171, (0 missing)
##   Surrogate splits:
##       PedigreeFunction < 2.778582     to the left,  agree=0.693, adj=0.057, (0 split)
##       SkinThickness    < 2.442516     to the left,  agree=0.687, adj=0.038, (0 split)
##       Insulin          < 3.234012     to the left,  agree=0.687, adj=0.038, (0 split)
##       BMI              < -0.2273645   to the right, agree=0.687, adj=0.038, (0 split)
##       Age              < 1.467583     to the left,  agree=0.687, adj=0.038, (0 split)
## 
## Node number 10: 144 observations
##   predicted class=0  expected loss=0.1388889  P(node) =0.2341463
##     class counts:   124    20
##    probabilities: 0.861 0.139 
## 
## Node number 11: 142 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.3802817  P(node) =0.2308943
##     class counts:    88    54
##    probabilities: 0.620 0.380 
##   left son=22 (53 obs) right son=23 (89 obs)
##   Primary splits:
##       Glucose          < -0.6378749   to the left,  improve=6.208780, (0 missing)
##       PedigreeFunction < 0.4621506    to the left,  improve=5.564858, (0 missing)
##       Insulin          < 0.5440675    to the left,  improve=2.770399, (0 missing)
##       BloodPressure    < 0.9245045    to the right, improve=1.881190, (0 missing)
##       BMI              < -0.1068697   to the right, improve=1.418053, (0 missing)
##   Surrogate splits:
##       BloodPressure    < -2.795309    to the left,  agree=0.634, adj=0.019, (0 split)
##       BMI              < 1.83373      to the right, agree=0.634, adj=0.019, (0 split)
##       PedigreeFunction < -1.128415    to the left,  agree=0.634, adj=0.019, (0 split)
## 
## Node number 12: 47 observations
##   predicted class=0  expected loss=0.2340426  P(node) =0.07642276
##     class counts:    36    11
##    probabilities: 0.766 0.234 
## 
## Node number 13: 17 observations
##   predicted class=1  expected loss=0.4117647  P(node) =0.02764228
##     class counts:     7    10
##    probabilities: 0.412 0.588 
## 
## Node number 14: 110 observations,    complexity param=0.01085271
##   predicted class=1  expected loss=0.3545455  P(node) =0.1788618
##     class counts:    39    71
##    probabilities: 0.355 0.645 
##   left son=28 (95 obs) right son=29 (15 obs)
##   Primary splits:
##       BloodPressure    < -0.4187616   to the right, improve=4.366507, (0 missing)
##       Insulin          < 1.498564     to the right, improve=3.525327, (0 missing)
##       Age              < -0.2330639   to the left,  improve=2.257921, (0 missing)
##       BMI              < -0.1449207   to the right, improve=2.168984, (0 missing)
##       PedigreeFunction < 0.7790565    to the left,  improve=1.897678, (0 missing)
## 
## Node number 15: 53 observations
##   predicted class=1  expected loss=0.0754717  P(node) =0.08617886
##     class counts:     4    49
##    probabilities: 0.075 0.925 
## 
## Node number 22: 53 observations
##   predicted class=0  expected loss=0.1886792  P(node) =0.08617886
##     class counts:    43    10
##    probabilities: 0.811 0.189 
## 
## Node number 23: 89 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.494382  P(node) =0.1447154
##     class counts:    45    44
##    probabilities: 0.506 0.494 
##   left son=46 (17 obs) right son=47 (72 obs)
##   Primary splits:
##       PedigreeFunction < -0.8069823   to the left,  improve=5.964970, (0 missing)
##       BloodPressure    < 0.9245045    to the right, improve=2.115840, (0 missing)
##       Age              < 1.680164     to the right, improve=1.877657, (0 missing)
##       BMI              < 0.2229054    to the right, improve=1.316879, (0 missing)
##       SkinThickness    < 1.188772     to the right, improve=1.233310, (0 missing)
##   Surrogate splits:
##       SkinThickness < 1.690269     to the right, agree=0.831, adj=0.118, (0 split)
##       Glucose       < -0.5753214   to the left,  agree=0.820, adj=0.059, (0 split)
##       BloodPressure < -1.503707    to the left,  agree=0.820, adj=0.059, (0 split)
## 
## Node number 28: 95 observations,    complexity param=0.01085271
##   predicted class=1  expected loss=0.4105263  P(node) =0.1544715
##     class counts:    39    56
##    probabilities: 0.411 0.589 
##   left son=56 (41 obs) right son=57 (54 obs)
##   Primary splits:
##       Age              < -0.2330639   to the left,  improve=4.409842, (0 missing)
##       Insulin          < 1.498564     to the right, improve=3.612432, (0 missing)
##       Pregnancies      < 0.787917     to the left,  improve=2.388561, (0 missing)
##       PedigreeFunction < 0.7790565    to the left,  improve=2.245614, (0 missing)
##       cluster          splits as  RL, improve=2.063618, (0 missing)
##   Surrogate splits:
##       cluster       splits as  RL, agree=0.926, adj=0.829, (0 split)
##       Pregnancies   < -0.1024022   to the left,  agree=0.758, adj=0.439, (0 split)
##       BloodPressure < 0.4078637    to the left,  agree=0.642, adj=0.171, (0 split)
##       Insulin       < 0.9215275    to the right, agree=0.642, adj=0.171, (0 split)
##       BMI           < 0.9205066    to the right, agree=0.642, adj=0.171, (0 split)
## 
## Node number 29: 15 observations
##   predicted class=1  expected loss=0  P(node) =0.02439024
##     class counts:     0    15
##    probabilities: 0.000 1.000 
## 
## Node number 46: 17 observations
##   predicted class=0  expected loss=0.1176471  P(node) =0.02764228
##     class counts:    15     2
##    probabilities: 0.882 0.118 
## 
## Node number 47: 72 observations,    complexity param=0.01395349
##   predicted class=1  expected loss=0.4166667  P(node) =0.1170732
##     class counts:    30    42
##    probabilities: 0.417 0.583 
##   left son=94 (48 obs) right son=95 (24 obs)
##   Primary splits:
##       PedigreeFunction < 0.2689889    to the left,  improve=3.125000, (0 missing)
##       Glucose          < -0.4814911   to the right, improve=2.329032, (0 missing)
##       BloodPressure    < 0.9245045    to the right, improve=1.864516, (0 missing)
##       Age              < 1.467583     to the right, improve=1.864516, (0 missing)
##       BMI              < 1.370777     to the left,  improve=1.162637, (0 missing)
##   Surrogate splits:
##       Insulin       < 0.5874537    to the left,  agree=0.708, adj=0.125, (0 split)
##       BMI           < 1.142471     to the left,  agree=0.708, adj=0.125, (0 split)
##       BloodPressure < -0.9354024   to the right, agree=0.694, adj=0.083, (0 split)
##       Pregnancies   < -0.9927214   to the right, agree=0.681, adj=0.042, (0 split)
##       Glucose       < -0.5753214   to the right, agree=0.681, adj=0.042, (0 split)
## 
## Node number 56: 41 observations,    complexity param=0.01085271
##   predicted class=0  expected loss=0.4146341  P(node) =0.06666667
##     class counts:    24    17
##    probabilities: 0.585 0.415 
##   left son=112 (10 obs) right son=113 (31 obs)
##   Primary splits:
##       Insulin          < 1.498564     to the right, improve=4.5476000, (0 missing)
##       PedigreeFunction < -0.462913    to the left,  improve=2.5892520, (0 missing)
##       BloodPressure    < 0.2012074    to the right, improve=1.6255160, (0 missing)
##       BMI              < 1.24394      to the left,  improve=0.9656574, (0 missing)
##       Glucose          < 1.144901     to the left,  improve=0.8797118, (0 missing)
##   Surrogate splits:
##       Glucose          < 1.3482       to the right, agree=0.805, adj=0.2, (0 split)
##       PedigreeFunction < -0.7194559   to the left,  agree=0.780, adj=0.1, (0 split)
##       Age              < -0.3180962   to the right, agree=0.780, adj=0.1, (0 split)
## 
## Node number 57: 54 observations
##   predicted class=1  expected loss=0.2777778  P(node) =0.08780488
##     class counts:    15    39
##    probabilities: 0.278 0.722 
## 
## Node number 94: 48 observations,    complexity param=0.01395349
##   predicted class=0  expected loss=0.4791667  P(node) =0.07804878
##     class counts:    25    23
##    probabilities: 0.521 0.479 
##   left son=188 (15 obs) right son=189 (33 obs)
##   Primary splits:
##       Insulin          < -0.5969897   to the right, improve=1.970455, (0 missing)
##       PedigreeFunction < -0.1535524   to the right, improve=1.968860, (0 missing)
##       BloodPressure    < -0.1087771   to the right, improve=1.756859, (0 missing)
##       Glucose          < -0.04361642  to the right, improve=1.678842, (0 missing)
##       Age              < 1.255002     to the right, improve=1.462607, (0 missing)
##   Surrogate splits:
##       SkinThickness    < -0.91125     to the right, agree=0.792, adj=0.333, (0 split)
##       PedigreeFunction < -0.2214608   to the right, agree=0.750, adj=0.200, (0 split)
##       Glucose          < 0.1127674    to the right, agree=0.729, adj=0.133, (0 split)
##       BMI              < 0.9648994    to the right, agree=0.729, adj=0.133, (0 split)
## 
## Node number 95: 24 observations
##   predicted class=1  expected loss=0.2083333  P(node) =0.03902439
##     class counts:     5    19
##    probabilities: 0.208 0.792 
## 
## Node number 112: 10 observations
##   predicted class=0  expected loss=0  P(node) =0.01626016
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 113: 31 observations,    complexity param=0.01085271
##   predicted class=1  expected loss=0.4516129  P(node) =0.0504065
##     class counts:    14    17
##    probabilities: 0.452 0.548 
##   left son=226 (22 obs) right son=227 (9 obs)
##   Primary splits:
##       BMI              < 1.231256     to the left,  improve=1.3346370, (0 missing)
##       Glucose          < 0.5506421    to the left,  improve=1.2798390, (0 missing)
##       PedigreeFunction < -0.462913    to the left,  improve=1.2476960, (0 missing)
##       BloodPressure    < 0.4078637    to the right, improve=1.2009930, (0 missing)
##       Insulin          < 0.63084      to the left,  improve=0.8765778, (0 missing)
##   Surrogate splits:
##       BloodPressure < 0.7953443    to the left,  agree=0.806, adj=0.333, (0 split)
##       SkinThickness < 1.251459     to the left,  agree=0.774, adj=0.222, (0 split)
##       Glucose       < 0.5819188    to the left,  agree=0.742, adj=0.111, (0 split)
## 
## Node number 188: 15 observations
##   predicted class=0  expected loss=0.2666667  P(node) =0.02439024
##     class counts:    11     4
##    probabilities: 0.733 0.267 
## 
## Node number 189: 33 observations,    complexity param=0.01395349
##   predicted class=1  expected loss=0.4242424  P(node) =0.05365854
##     class counts:    14    19
##    probabilities: 0.424 0.576 
##   left son=378 (12 obs) right son=379 (21 obs)
##   Primary splits:
##       BloodPressure    < 0.3045355    to the right, improve=2.2164500, (0 missing)
##       PedigreeFunction < -0.2833329   to the right, improve=1.4948380, (0 missing)
##       Age              < 1.340034     to the right, improve=1.4948380, (0 missing)
##       Glucose          < -0.3720224   to the right, improve=1.4429510, (0 missing)
##       Pregnancies      < 1.08469      to the right, improve=0.8864295, (0 missing)
##   Surrogate splits:
##       Age              < 0.8298403    to the right, agree=0.788, adj=0.417, (0 split)
##       BMI              < -0.4620122   to the left,  agree=0.727, adj=0.250, (0 split)
##       SkinThickness    < -0.3784087   to the right, agree=0.667, adj=0.083, (0 split)
##       PedigreeFunction < -0.6938016   to the left,  agree=0.667, adj=0.083, (0 split)
## 
## Node number 226: 22 observations,    complexity param=0.01085271
##   predicted class=0  expected loss=0.4545455  P(node) =0.03577236
##     class counts:    12    10
##    probabilities: 0.545 0.455 
##   left son=452 (11 obs) right son=453 (11 obs)
##   Primary splits:
##       Pregnancies   < -0.3991752   to the left,  improve=1.4545450, (0 missing)
##       BloodPressure < 0.2012074    to the right, improve=1.4545450, (0 missing)
##       Insulin       < -0.2368842   to the left,  improve=1.4545450, (0 missing)
##       BMI           < 0.0009413653 to the right, improve=1.3852810, (0 missing)
##       Glucose       < 0.5506421    to the left,  improve=0.7305195, (0 missing)
##   Surrogate splits:
##       Glucose       < 0.8164946    to the left,  agree=0.727, adj=0.455, (0 split)
##       BMI           < 0.5653642    to the right, agree=0.727, adj=0.455, (0 split)
##       SkinThickness < 1.094741     to the right, agree=0.682, adj=0.364, (0 split)
##       BloodPressure < 0.2012074    to the right, agree=0.636, adj=0.273, (0 split)
##       Insulin       < -0.2368842   to the left,  agree=0.636, adj=0.273, (0 split)
## 
## Node number 227: 9 observations
##   predicted class=1  expected loss=0.2222222  P(node) =0.01463415
##     class counts:     2     7
##    probabilities: 0.222 0.778 
## 
## Node number 378: 12 observations
##   predicted class=0  expected loss=0.3333333  P(node) =0.0195122
##     class counts:     8     4
##    probabilities: 0.667 0.333 
## 
## Node number 379: 21 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.03414634
##     class counts:     6    15
##    probabilities: 0.286 0.714 
## 
## Node number 452: 11 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.01788618
##     class counts:     8     3
##    probabilities: 0.727 0.273 
## 
## Node number 453: 11 observations
##   predicted class=1  expected loss=0.3636364  P(node) =0.01788618
##     class counts:     4     7
##    probabilities: 0.364 0.636
prp(CART)

predClassCART <- predict(CART, newdata = test.data, type = "class")
confusionMatrix(predClassCART, test.data$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 15
##          1 12 38
##                                           
##                Accuracy : 0.8235          
##                  95% CI : (0.7537, 0.8804)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 2.528e-06       
##                                           
##                   Kappa : 0.605           
##                                           
##  Mcnemar's Test P-Value : 0.7003          
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.7170          
##          Pos Pred Value : 0.8544          
##          Neg Pred Value : 0.7600          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5752          
##    Detection Prevalence : 0.6732          
##       Balanced Accuracy : 0.7985          
##                                           
##        'Positive' Class : 0               
## 
predProbCART <- predict(CART, newdata = test.data, type = "prob")[,2]
CART.ROC <- performance(prediction(predProbCART, test.data$Outcome), measure="tpr", x.measure="fpr")
CART.AUC <- performance(prediction(predProbCART, test.data$Outcome), measure="auc")@y.values

RF

library(randomForest) 
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
RandomForest <- randomForest(Outcome ~ ., data=train.data, importance = TRUE, ntrees = 500)
summary(RandomForest)
##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted        615   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           1230   matrix numeric  
## oob.times        615   -none- numeric  
## classes            2   -none- character
## importance        36   -none- numeric  
## importanceSD      27   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                615   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
plot(RandomForest)

# Get the probability of "yes" for default from second column
predProbRF <- predict(RandomForest, newdata = test.data, type = "prob")[,2]


# Get the predicted class
predClassRF <- predict(RandomForest, newdata = test.data, type = "response")

RF.ROC <- performance(prediction(predProbRF, test.data$Outcome), 
                       measure="tpr", x.measure="fpr")
RF.AUC <- performance(prediction(predProbRF, test.data$Outcome), 
                       measure="auc")@y.values

Question 2

Apply all classification models you learned in this class to this diabetes dataset to predict the type of Outcome based on the 8 different kind of chemical components. Which classification method was the best? Use AUC to decide the best one.

Then, apply all classification models you learned in this class to this diabetes dataset to predict the type of Outcome based on the 8 different kind of chemical components and the results from one clustering method from the first part. Explain why you pick one out of all clustering methods. Which classification method was the best? Use AUC to decide the best one.

Does including results from clustering methods increase the accuracy rates?

We see from model summaries that the identified cluster from K means does not play a significant role in the Logit or CART models. CART is also the worst model in terms of AUC. However, we know that cluster is getting used in random trees within the RF model, which is our best performing model. It is possible that there is interaction or correlation between multiple predictors that is not captured within the Logit or CART models that is getting leveled out in the RF model.

Interpret the result from each classifying method. Pros and cons?

### Plot our ROC Performance with Logit as base
plot(Logit.ROC, lwd=2, main = "ROC Comparison for Models on Test Dataset")

# Add SVM
lines(attributes(SVM.ROC)$x.values[[1]], attributes(SVM.ROC)$y.values[[1]], 
      col="red", lwd=2)

# Add KNN
lines(attributes(KNN.ROC)$x.values[[1]], attributes(KNN.ROC)$y.values[[1]], 
      col="blue", lwd=2)

# Add CART 
lines(attributes(CART.ROC)$x.values[[1]], attributes(CART.ROC)$y.values[[1]], 
      col="green", lwd=2)

# Add RF
lines(attributes(RF.ROC)$x.values[[1]], attributes(RF.ROC)$y.values[[1]], 
      col="Brown", lwd=2)

#Add Legend
legend(x=.5, y=.6, c("Logistic (Step)", "SVM", "KNN", "CART", "RF"), 
       col=c("black", "red", "blue", "green", "Brown"), lwd=c(2,2,2,2,2))

Which one is the best one among all classifiers?

Logit.AUC
## [[1]]
## [1] 0.8415094
SVM.AUC
## [[1]]
## [1] 0.8426415
KNN.AUC
## [[1]]
## [1] 0.8383962
CART.AUC
## [[1]]
## [1] 0.8117925
RF.AUC
## [[1]]
## [1] 0.8559434